Introduction to Artificial Neural Networks

Authors

Esteban Vegas

Ferran Reverter

Alex Sanchez

Published

May 6, 2024

From Artificial Neurons to Artificial Intelligence

In the post-pandemic world a big phenomenon is impacting society: a lightning rise of Artificial Intelligence with a mess of realities and promises.

Although it has been growing steadily, it has probably been since the appearence of ChatGPT, followed by other LLMs, that everybody has an experience, an opinion, or a fear on the topic.

Most tasks performed by AI can be described as Classification or Prediction used in a wide range of applications ranging from from recommender systems, natural language processing, image recognition or image generation.

AI engines use statistical learning methods, such as machine learning algorithms, to make predictions based on large amounts of data.

However, it is important to keep in mind that AI has far-reaching implications beyond its predictive capabilities, including ethical, social, and technological considerations that must be taken into account when developing and deploying AI systems.

The rise of Deep learning

Today, when technically trained people think about AI, one of the first concepts that comes into mind is Deep Learning.

Deep learning is a highly successful model in the field of AI, which has powered numerous applications in various domains. It has shown remarkable performance in tasks such as image recognition, natural language processing, and speech recognition.

Deep Learning, indeed, has been developed in the field of ARtificial Neural Networks (ANNs), but it extends the basic principles of ANNs by introducing more complex architectures and algorithms and, at the same time, by enabling machines to learn from large datasets by automatically identifying relevant patterns and features without explicit programming.

One key advantage of deep learning over traditional machine learning algorithms is its ability to handle high-dimensional and unstructured data such as images, videos, and audio.

The early history of artificial [neural networks]/intelligence

A Brief History of AI from 1940s till Today

The origins of Deep learning and modern Artificial Intelligence can be traced back to the per4ceptron. Source: “A Quick History of AI, ML and DL”

The origins of Deep learning and modern Artificial Intelligence can be traced back to the per4ceptron. Source: “A Quick History of AI, ML and DL”

The origins of AI, and as such of DL can be traced almost one century backwards. While it is an interesting, or even fascinating, history wee don’t go into it (see a summary in A Quick History of AI, ML and DL

We can see there however, several hints worth to account for because we will go through them to understand how a deep neural network works. These are:

  • The Perceptron and the first Artificial Neural Network where the basic building block was introduced.

  • The Multilayered perceptron and back-propagation where complex architectures were suggested to improve the capabilities.

  • Deep Neural Networks, with many hidden layers, and auto-tunability capabilities.

In short, there has been an mathematical and a technological evolution that at some point has allowed to meet with

  • The required theoretical background (DNN)

  • The required computational capabilities (GPU, HPC)

  • The required quantity of data (Big Data, Images, Social Networks)

This has resulted in making artificial intelligence widely accessible to businesses, researchers, and the general public.

Why Deep Learning Now? Source: Alex Amini’s ‘MIT Introduction to Deep Learning’ course

Success stories such as

  • the development of self-driving cars,

  • the use of AI in medical diagnosis, and

  • the creation of personalized recommendations in online shopping

have also contributed to the widespread adoption of AI.

AI comes, however, also with fears from multiple sources from science fiction to religion

  • Mass unemployment

  • Loss of privacity

  • AI bias

  • AI fakes

  • Or, simply, AI takeover

Comparison with Traditional Machine Learning

A reasonable question is: “How are Artificial Intelligence, Machine Learning and Deep learning related”?

A standard answer can be found in the image below that has a myriad variations:

We can keep, for instance, the followin three definitions, which also have many variants:

  • Artificial intelligence is the ability of a computer to perform tasks commonly associated with intelligent beings.

  • Machine learning is the study of algorithms that learn from examples and experience instead of relying on hard-coded rules and make predictions on new data

  • Deep learning is a sub-field of machine learning focusing on learning data representations as successive successive layers of increasingly meaningful representations.

We will be coming back to the difference between ML and DL, but we can at least highlight two strengths of DL that differentiate it from ML:

  • DNN combine feature extraction and classification in a way that does not require (or dramatically decreases) human intervention.
  • The power of DNN requires in its ability to improve with data availability, without seemingly reaching plateaus as ML.

An illustration of the performance comparison between deep learning (DL) and other machine learning (ML) algorithms, where DL modeling from large amounts of data can increase the performance

Deep learning is having a strong impact in many day-to-day activities, which explains the growing interest in the topic at all levels, not only technical or scientific one.

  • Near-human-level image classification

  • Near-human-level speech transcription

  • Near-human-level handwriting transcription

  • Dramatically improved machine translation

  • Dramatically improved text-to-speech conversion

  • Digital assistants such as Google Assistant and Amazon Alexa

  • Near-human-level autonomous driving

  • Improved ad targeting, as used by Google, Baidu, or Bing

  • Improved search results on the web

  • Ability to answer natural language questions

  • Superhuman Go playing

According to (Allaire, Kalinowski, and Chollet 2022) … “we shouldn’t believe the short-term hype, but should believe in the long-term vision. It may take a while for AI to be deployed to its true potential—a potential the full extent of which no one has yet dared to dream—but AI is coming, and it will transform our world in a fantastic way”.

Once the introduction is ready we con move onto the building blocks of neural networks, perceptrons.

The Artificial Neuron (AN)

Mc Cullogh’s neuron and the perceptron

The first computational model of a neuron was proposed by Warren McCullough (neuroscientist) and Walter Pitts (logician) in 1943.

THe ide was to build as a mathematical model that might emulate a neuron, in the sense that given some inputs, and an appropriate set of examples, learn to produce the desired output?

It may be divided into 2 parts. The first part, \(g\),takes an input (ahem dendrite ahem), performs an aggregation and based on the aggregated value the second part, \(f\), makes a decision. See the source of this picture for an illustration on how this can be used to emulate logical operations such as AND, OR or NOT, but not XOR.

This first attempt to emulate neurons succeeded but with limitations:

  • What about non-Boolean (say, real) inputs?

  • What if all inputs are not equal?

  • What if we want to assign more importance to some inputs?

  • What about functions which are not linearly separable? Say XOR function

To overcome these limitations Frank Rosenblatt, an American psychologist, proposed the classical perceptron model, the artificial neuron, in 1958.

It is a more generalized computational model than the McCullough-Pitts neuron where weights and thresholds can be learnt over time.

The perceptron proposed by Rosenblatt this is very similar to an M-P neuron but it takes a weighted sum of the inputs and sets the output to one iff the sum is more than an arbitrary threshold (\(\theta\)).

Additionally, instead of hand coding the thresholding parameter \(\theta\), we add it as one of the inputs, with the weight \(w_0=-\theta\) like shown below, which makes it learnable.

Now, while this is an improvement (because both the weights and the threshold can be learned and the inputs can be real values) there is still a drawback in that a single perceptron can only be used to implement linearly separable functions.

Artificial Neural Networks improve on this by introducing Activation Functions which, eventually, can be non-linear.

Neurons and Activation Functions

How come biological and artificial neurons come to compare?

  • Biological neurons are specialized cells in the central nervous system that transmit electrical and chemical signals to communicate with each other and the rest of the body.

  • On the other hand, artificial neurons are mathematical models used in neural networks to process information.

Neurons do not need to respond to all stimuli. When the stimulus is big enough to produce a response we say that the neurom has been activated. By extension one can consider that activation is controlled (or decided) by an activation function.

In both biological and artificial neurons, the activation function is what is responsible for deciding whether the neuron activates or not based on the input it receives.

  • In the case of a biological neuron, the activation function is based on the release of neurotransmitters, which are chemical substances that transmit signals between nerve cells. When the electrical signal reaching the neuron exceeds a certain threshold, the neuron releases neurotransmitters, which are received by other neurons or cells in the body to continue the communication process.
  • On the other hand, in an artificial neuron, the activation function is a mathematical function applied to the neuron’s input to produce an output. Like in the biological neuron, this activation function decides whether the neuron activates or not based on the input it receives.

An activation function is a function that is added into an artificial neuron in order to help it learn complex patterns in the data.

Code
knitr::include_graphics("images/ActivationFunction0.png")

Read more here about activation functions.

With all these inputs in mind we can now define an Artificial Neuron as a computational unit that

  • takes as input \(x=(x_0,x_1,x_2,x_3)\) (\(x_0\) = +1, called bias), and

  • outputs \(h_{\theta}(x) = f(\theta^\intercal x) = f(\sum_i \theta_ix_i)\),

  • where \(f:\mathbb{R}\mapsto \mathbb{R}\) is called the activation function.

The goal of the activation function is to provide the Neuron with the capability of producing the required outputs.

For instance, if the output has to be a probability, the activation function will only produce values between 0 and 1.

With this idea in mind activation functions are chosen from a set of pre-defined functions:

  • the sigmoid function:

\[ f(z)=\frac{1}{1+e^{-z}} \]

  • the hyperbolic tangent, or tanh, function:

\[ f(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \]

The tanh(z) function is a re-scaled version of the sigmoid, and its output range is \([-1,1]\) instead of \([0,1]\).

Two useful properties to recall are that:

  • If \(f(z)=1/(1+e^z)\) is the sigmoid function, then its derivative is given by \(f'(z)=f(z)(1-f(z))\).

  • Similarly, if \(f\) is the tanh function, then its derivative is given by \(f'(z)=1-(f(z))^2\).

In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU defined by the activation function \(f(z)=\max\{0,z\}\).

  • This function remains very close to a linear one, in the sense that is a piece-wise linear function with two linear pieces.

  • Because rectified linear units are nearly linear, they preserve many of the properties that make linear models easy to optimize with gradient based methods.

  • They also preserve many of the properties that make linear models generalize well.

.

Putting altogether we have the following schematic representation of an artificial neuron where \(\Sigma=\left\langle w_{j}, x\right\rangle+ b_{j}\) and \(\left\langle w_{j}, x\right\rangle\) represents the dot product between vectors \(w\) and \(x\).

In summary:

  • An artificial neuron takes a vector of input values \(x_{1}, \ldots, x_{d}\) and combines it with some weights that are local to the neuron \(\left(w_{0}, w_{1}, . ., w_{d}\right)\) to compute a net input \(w_{0}+\sum_{i=1}^{d} w_{i} * x_{i}\).

  • To compute its output, it then passes the net input through a possibly non-linear univariate activation function \(g(\cdot)\), usually vchosen from a set of options such as Sigmoid, Tanh or ReLU functions

  • Given there is a special weight \(w_{0}\), the bias, in order to treat all weights equally, we use the trick of creating an extra input variable \(x_{0}\) with value always equal to 1 , and so the function computed by a single artificial neuron (parameterized by its weights \(\mathbf{w}\) ) is:

\[ y(\mathbf{x})=g\left(w_{0}+\sum_{i=1}^{d} w_{i} x_{i}\right)=g\left(\sum_{i=0}^{d} w_{i} x_{i}\right)=g\left(\mathbf{w}^{\mathbf{T}} \mathbf{x}\right) \]

  • Single artificial neurons are used to model functions \(\mathbb{R}^{d} \mapsto \mathbb{R}\).
    • The weights \(w_{0}, w_{1}, . ., w_{d}\) are the parameters of the model. Different values will compute different functions.

    • As usual, the learning task is to find suitable weights that allow neurons to model the data as accurately as possible.

From neurons to neural networks

The basic neural networks

Following with the brain analogy one can combine (artificial) neurons to create better learners.

A simple artificial neural network is usually created by combining two types of modifications to the basic perceptron (AN).

  • Stacking several neurons insteads of just one.
  • Adding an additional layer of neurons, which is call a hidden layer,

This yields a system where the output of a can be the input of another in many different ways.

The image below shows a small neural network with three neurons stacked in a hidden layer.

In this figure, we have used circles to also denote the inputs to the network.

  • The circles labeled +1 are called bias units, and correspond to the intercept term.

  • The leftmost layer of the network is called the input layer.

  • The rightmost layer of the network is called the output layer.

  • The middle layer of nodes is called the hidden layer, because its values are not observed in the training set.

Bias nodes are not counted when stating the neuron size.

With all this in mind our example neural network has three layers with:

  • 3 input units (not counting the bias unit),
  • 3 hidden units, and
  • 1 output unit.

An ANN is a predictive model (a learner) whose properties and behaviour can be well characterized.

  • The ANN operates through a process known as forward propagation, which encompasses the entire journey of information flow from the input layer to the output layer.

  • Forward propagation is performed by composing a series of linear and non-linear (activation) functions.

  • These are characterized (parametrized) by their weights and biases, that need to be learnt. This is done by training the ANN.

  • In order for the ANN to perform well, the training process aims at finding the best possible parameter values for the learning task defined by the fnctions. This is done by

    • Selecting an appropriate (convex) loss function,
    • Finding those weights that minimize a the total cost function (avg. loss).
  • This is usually done using some iterative optimization procedure such as gradient descent.

    • This requires evaluating derivatives in a huge number of points.
    • Such high number may be reduced by Stochastic Gradient Descent.
    • The evaluation of derivatives is simplified thanks to Backpropagation.

A logistic regression ANN

The simple ANN shown above can be casted to perform a logistic regression:

  • The transformation of the inputs through the hidden layer yields new set of complex features.

  • Such complex features may be transformed using a sigmoid activation function to produce the output, which takes the form: \[ \mbox{The output is: }h_{\theta}(x)=\frac{1}{1+e^{-\theta^\intercal x}} \]

Now, recall that, the logistic regression model is:

\[ \log\frac{p(Y=1|x)}{1-p(Y=1|x)}=\theta^\intercal x \]

By taking logs in both sides and then isolating \(p(Y=1|x)\), we have:

\[\begin{eqnarray} \frac{p(Y=1|x)}{1-p(Y=1|x)}&=&e^{\theta^\intercal x}\\ p(Y=1|x)&=& e^{\theta^\intercal x}\cdot(1-p(Y=1|x))\\ p(Y=1|x)[1+e^{\theta^\intercal x}]&=& e^{\theta^\intercal x}\\ p(Y=1|x)&=&\frac{e^{\theta^\intercal x}}{1+e^{\theta^\intercal x}}=\frac{1}{1+e^{-\theta^\intercal x}} \end{eqnarray}\]

That is, when the activation function of the output node is the sigmoid activation function, the output coincides with a logistic regression on complex features

  • And, \(h_{\theta}(x)\), the output of the ANN, estimates \(p(Y=1|x)\).

The parameters of an ANN

  • Let \(n_l\) denote the number of layers in our network, thus \(n_l=3\) in our example.

  • Label layer \(l\) as \(L_l\), so layer \(L_1\) is the input layer, and layer \(L_{n_l}=L_3\) the output layer.

  • Our neural network has parameters: \(\Theta=(\Theta^{(1)},\Theta^{(2)})\), where we will write \(\theta^{(l)}_{ij}\) to denote the parameter (or weight) associated with the connection between unit \(j\) in layer \(l\), and unit \(i\) in layer \(l+1\).

  • Thus, in our example, we have:

    • \(\Theta^{(1)}\in\mathbb{R}^{3\times 4}\), and
    • \(\Theta^{(2)}\in\mathbb{R}^{1\times 4}\).

Note that bias units don’t have inputs or connections going into them, since they always output the value +1.

We also let \(s_l\) denote the number of nodes in layer \(l\) (not counting the bias unit).

Now, write \(a^{(l)}_i\) to denote the activation (meaning output value) of unit \(i\) in layer \(l\).

For \(l=1\), we also use \(a^{(1)}_i=x_i\) to denote the \(i\)-th input.

Given a fixed setting of the parameters \(\Theta\), our neural network defines a model \(h_{\Theta}(x)\) that outputs a real number.

Forward propagation

As described above the process that encompasses the computations required to go from the input values to the final output is known as forward propagation.

To understand how it works let’s check how the weights are combined with the input to produce the final output.

Each node, \(a_i^{(2)}\) of the hidden layer opperates on all nodes of the input values as follows:

\[\begin{eqnarray} a_1^{(2)}&=&f(\theta_{10}^{(1)}+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1)}x_3)\\ a_2^{(2)}&=&f(\theta_{20}^{(1)}+\theta_{21}^{(1)}x_1+\theta_{22}^{(1)}x_2+\theta_{23}^{(1)}x_3)\\ a_3^{(2)}&=&f(\theta_{30}^{(1)}+\theta_{31}^{(1)}x_1+\theta_{32}^{(1)}x_2+\theta_{33}^{(1)}x_3)) \end{eqnarray}\]

The output of the hidden layer is transformed through the activation function:

\[ h_{\Theta}(x)=a_1^{(3)}=f(\theta_{10}^{(2)}+\theta_{11}^{(2)}a_1^{(2)}+\theta_{12}^{(2)}a_2^{(2)}+\theta_{13}^{(2)}a_3^{(2)} \]

A more compact notation

If we let \(z_i^{(l)}\) denote the total weighted sum of inputs to unit \(i\) in layer \(l\), including the bias term:

\[ z_i^{(2)}=\theta_{i0}^{(1)}+\theta_{i1}^{(1)}x_1+\theta_{i2}^{(1)}x_2+\theta_{i3}^{(1)}x_3, \] the output becomes: \(a_i^{(l)}=f(z_i^{(l)})\).

Extending the activation function \(f(\cdot)\) to apply to vectors in an element-wise fashion:

\[ f([z_1,z_2,z_3]) = [f(z_1), f(z_2),f(z_3)], \] we can write the previous equations more compactly as:

\[\begin{eqnarray} z^{(2)}&=&\Theta^{(1)}x\nonumber\\ a^{(2)}&=&f(z^{(2)})\nonumber\\ z^{(3)}&=&\Theta^{(2)}a^{(2)}\nonumber\\ h_{\Theta}(x)&=&a^{(3)}=f(z^{(3)})\nonumber \end{eqnarray}\]

More generally, recalling that we also use \(a^{(1)}=x\) to also denote the values from the input layer,

Given layer \(l\)’s activations \(a^{(l)}\), we can compute layer \(l+1\)’s activations \(a^{(l+1)}\) as:

\[\begin{equation} z^{(l+1)}=\Theta^{(l)}a^{(l)} \label{eqforZs} \end{equation}\]

\[\begin{equation} a^{(l+1)}=f(z^{(l+1)}) \label{eqforAs} \end{equation}\]

This can be used to provide a matrix representation for the weighted sum of inputs of all neurons:

\[ z^{(l+1)}= \begin{bmatrix} z_1^{(l+1)}\\ z_2^{(l+1)}\\ \vdots\\ z_{s_{l+1}}^{(l)} \end{bmatrix}= \begin{bmatrix} \theta_{10}^{(l)}& \theta_{11}^{(l)}&\theta_{12}^{(l)}&...&\theta_{1s_{l}}^{(l)}&\\ \theta_{20}^{(l)}& \theta_{21}^{(l)}&\theta_{22}^{(l)}&...&\theta_{2s_{l}}^{(l)}&\\ \vdots & \vdots& \vdots & \vdots & \vdots\\ \theta_{s_{l+1}0}^{(l)}& \theta_{s_{l+1}1}^{(l)}&\theta_{s_{l+1}2}^{(l)}&...&\theta_{s_{l+1}s_{l}}^{(l)}&\\ \end{bmatrix} \cdot\begin{bmatrix} 1\\ a_1^{(l)}\\ a_2^{(l)}\\ \vdots\\ a_{s_l}^{(l)} \end{bmatrix} \]

So that, the activation is then:

\[ a^{(l+1)}= \begin{bmatrix} a_1^{(l+1)}\\ a_2^{(l+1)}\\ \vdots\\ a_{s_{l+1}}^{(l)} \end{bmatrix}=f(z^{(l+1)})=\begin{bmatrix} f(z_1^{(l+1)})\\ f(z_2^{(l+1)})\\ \vdots\\ f(z_{s_{l+1}}^{(l)}) \end{bmatrix} \]

Multiple architectures for ANN

  • We have so far focused on a single hidden layer neural network of the example.

  • Indeed, One can build neural networks with many distinct architectures (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.

  • See here The Neural Network Zoo.

Feed Forward Neural networks

The most common choice is a \(n_l\)-layered network where layer 1 is the input layer, layer \(n_l\) is the output layer, and each layer \(l\) is densely connected to layer \(l+1\).

In this setting, to compute the output of the network, we can successively compute all the activations in layer \(L_2\), then layer \(L_3\), and so on, up to layer \(L_{nl}\) , using the equations above. This is one example of a feed-forward neural network (FFNN), since the connectivity graph does not have any directed loops or cycles.

The number of output units

Neural networks can also have multiple output units.

For example, in the image below we can see a network with two hidden layers layers \(L_2\) and \(L_3\) and four output units in layer \(L_4\), where bias of each layer were omitted.

Neural network

To train this network, we would need training examples \((x^{(i)},y^{(i)})\) where \(y^{(i)}\in\mathbb{R}^4\).

This type of network is useful if there are multiple outputs that you’re interested in predicting.

  • For example, in a medical diagnosis application, the vector \(x\) might give the input features of a patient, and the different outputs \(y_i\)’s might indicate presence or absence of different diseases.

Training Neural Networks

  • An ANN is a predictive model whose properties and behaviour can be mathematically characterized.

  • In practice this means:

    • The ANN acts by composing a series of linear and non-linear (activation) functions.
    • These are characterized by their weights and biases, that need to be learnt.
  • Training the network is done by

    • Selecting an appropriate (convex) loss function,
    • Finding those weights that minimize a the total cost function (avg loss).

The tools for training

  • Training an ANN is usually done using some iterative optimization procedure such as Gradient Descent.

  • This requires evaluating derivatives in a huge number of points.

    • Such high number may be reduced by Stochastic Gradient Descent.
    • The evaluation of derivatives is simplified thanks to Backpropagation.
  • It is important to realize that backpropagation is both

    • The algorithm for training, i.e. for tuning the weights in order to minimize the loss function
    • A way to train efficientlt thanks to a series of tricks and techniques such as automatic differentiation.

A loss function for optimization

A first idea may be to use squared error loss:

\[ l(h_\theta(x),y)=\left (y-\frac{1}{1+e^{-\theta^\intercal x}}\right )^2 \]

However it happens to be that, given a sigmoid activation function, the resulting loss function * is is not a convex problem* which means that MSE is not appropriate.

Quadratic loss may be used, for instance, with ReLu activation.

To illustrate this, we can consider a simple neural network with one input, one hidden layer with a logistic activation function, and one output neuron.

We aim to minimize the squared error loss function with respect to the weights of the network. Here’s how we can visualize the loss surface using R:

Code
library(plot3D)

# Define the squared error loss function
squared_error <- function(y, y_hat) {
  return(0.5 * (y - y_hat)^2)
}

# Define the logistic activation function
logistic <- function(z) {
  return(1 / (1 + exp(-z)))
}

# Generate data
x <- seq(-10, 10, length.out = 200)
y <- seq(-10, 10, length.out = 200)
z <- outer(x, y, FUN = function(x, y) squared_error(1, logistic(x + y)))

# Scale the z values to exaggerate the differences
z <- z * 500

# Plot the exaggerated loss surface
persp3D(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue", ticktype = "detailed", xlab = "Weight 1", ylab = "Weight 2", zlab = "Loss")

This plot illustrates the loss surface of a neural network with two weights to optimize. As can be seen, the surface is not convex, indicating the presence of multiple local minima.

An alternative to squared error loss is the binary cross-entropy loss function :

\[ l(h_\theta(x),y)=\big{\{}\begin{array}{ll} -\log h_\theta(x) & \textrm{if }y=1\\ -\log(1-h_\theta(x))& \textrm{if }y=0 \end{array} \]

This function can also be written as:

\[ l(h_\theta(x),y)=-y\log h_\theta(x) - (1-y)\log(1-h_\theta(x)) \]

Using cross-entropy loss, the cost function is of the form:

\[\begin{eqnarray*} J(\theta)=-\frac{1}{n}\big[\sum_{i=1}^n&&(y^{(i)}\log h_\theta(x^{(i)})+\\ &&(1-y^{(i)})\log(1-h_\theta(x^{(i)}))\big] \end{eqnarray*}\]

Now, this is a convex optimization problem.

In practice we often work with a regularized version of the cost function (we don’t regularize the bias units)

\[\begin{eqnarray*} J(\Theta)&=&-\frac{1}{n}\big[\sum_{i=1}^n \sum_{k=1}^K y_k^{(i)}\log( h_\theta(x^{(i)}))_k\\ &+&(1-y_k^{(i)})\log(1-(h_\theta(x^{(i)}))_k)\big]\\ &+&\lambda\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}} (\theta_{ji}^{(l)})^2 \end{eqnarray*}\]

Parameter optimizations with Gradient Descent

We saw in the previous section that training a network corresponds to choosing the parameters, that is, the weights and biases, that minimize the cost function.

The weights and biases take the form of matrices and vectors, but at this stage it is convenient to imagine them stored as a single vector that we call \(\theta\). Generally, we will suppose \(\theta\in\mathbb{R}^p\), and write the cost function as \(J(\theta)\) to emphasize its dependence on the parameters. So Cost \(J: \mathbb{R}^p\rightarrow \mathbb{R}\).

Error hyper-surface

Gradient Descent is a classical method in optimization that is often named by one of its variants referred to as steepest descent.

Given a function \(J(\theta)\) to be minimized, the method proceeds iteratively, computing a sequence of vectors \(\theta^1, \theta^2, ..., \theta^n\) in \(\mathbb{R}^p\) with the aim of converging to a vector that minimizes the cost function.

Suppose that our current vector is \(\theta\). How should we choose a perturbation, \(\Delta\theta\), so that the next vector, \(\theta+\Delta\theta\), represents an improvement, that is: \(J(\theta +\Delta\theta) < J(\theta)\)?

As is often the case, we will proceed by linearization of the cost function using a Taylor series expansion.

If \(\Delta\theta\) is small, then ignoring terms of order \(||\Delta\theta||^2\) or higher, a Taylor series expansion gives:

\[ J(\theta+\Delta\theta)\approx J(\theta)+\sum_{i=1}^p\frac{\partial J(\theta)}{\partial\theta_i}\Delta\theta_i \] where \({\displaystyle \frac{\partial J(\theta)}{\partial\theta_i}}\) denotes the partial derivative of the cost function with respect to the \(i\)-th weight.

For convenience, we will let \(\nabla J(\theta)\in\mathbb{R}^p\) denote the vector of partial derivatives, known as the gradient, so that

\[\begin{equation}\label{g1} \nabla J(\theta)=\left(\frac{\partial J(\theta)}{\partial\theta_1},...,\frac{\partial J(\theta)}{\partial\theta_p}\right)^\intercal \end{equation}\]

Now the approximation can be written as:

\[\begin{equation}\label{g2} J(\theta+\Delta\theta)\approx J(\theta)+\nabla J(\theta)^\intercal\Delta\theta \end{equation}\]

Recalling that our aim is to reduce the value of the cost function, the Taylor approximation above motivates the idea of choosing \(\Delta\theta\) to make \(\nabla J(\theta)^\intercal\Delta\theta\) negative, because this will make the value of \(J(\theta+\Delta\theta)\) smaller.

The bigger in absolute value we can make this negative expression, the smaller will be the value of the cost function.

Now, how much is it possible to make \(\nabla J(\theta)^\intercal\Delta\theta\) decrease?

The answer to this question can be motivated using the Cauchy-Schwarz inequality.

Interlude: Cauchy-Schwarz

The Cauchy-Schwarz inequality, which states that for any \(f,g\in\mathbb{R}^p\), we have: \[ |f^\intercal g|\leq ||f||\cdot ||g||. \] Moreover, the two sides are equal if and only if \(f\) and \(g\) are linearly dependent (meaning they are parallel).

  • How much can \(\nabla J(\theta)^\intercal\Delta\theta\) decrease?

  • By Cauchy-Schwarz,biggest possible value for \(\nabla J(\theta)^\intercal\Delta\theta\) is the upper bound, \(||\nabla J(\theta)||\cdot ||\Delta\theta||\).

    • The equality is only reached when \(||\nabla J(\theta)||= ||\Delta\theta||\)
    • The highest possible negative value will come out when \(-\nabla J(\theta)=\Delta\theta\)
  • That is, we should choose \(\Delta\theta\) to lie in the direction of \(-\nabla J(\theta)\).

Going back to gradient

Keeping in mind that the Taylor linearization of \(J(\theta)\) is an approximation that is relevant only for small \(\Delta\theta\), we will limit ourselves to a small step in that direction. This leads to the update:

\[\begin{equation}\label{g3} \theta \rightarrow \theta-\eta\nabla J(\theta) \end{equation}\]

.

Here \(\eta\) is small step size that, in this context, is known as the learning rate.

In summary, givent a cost function \(J(\theta)\) to be optimized the gradient descent optimization proceeds as follows:

  1. Initialize \(\theta_0\) randomly or with some predetermined values
  2. Repeat until convergence: \[ \theta_{t+1} = \theta_{t} - \alpha \nabla J(\theta_{t}) \]
  3. Stop when: \(|J(\theta_{t+1}) - J(\theta_{t})| < \epsilon\)

Where:

  • \(\theta_0\) is the initial parameter vector,
  • \(\theta_t\) is the parameter vector at iteration \(t\),
  • \(\alpha\) is the learning rate,
  • \(\nabla J(\theta_{t})\) is the gradient of the loss function with respect to \(\theta\) at iteration \(t\),
  • \(\epsilon\) is a small positive value indicating the desired level of convergence.

Computing derivatives

In order to use gradient descent, we need to compute \(J(\theta)\) and the partial derivative terms \[ \frac{\partial}{\partial\theta_j}J(\theta) \]

Initialization

The input data have to be normalized to have approximately the same range.

The biases can be initialized to 0. They also cannot be initialized with the same values, otherwise, all the neurons of a hidden layer would have the same behavior.

Perhaps the only property known with complete certainty is that the initial parameters need to break symmetry between different units.

We generally initialize the weights at random: the values \(\theta_{ij}^{(l)}\) are i.i.d. Uniform on \([-c,c]\) with possibly \(c= 1/\sqrt{N_l}\) where \(N_l\) is the size of the hidden layer \(l\).

We also sometimes initialize the weights with a normal distribution \(N(0,0.01)\).

An example

In this example we train and use a “shallow neural network”, called this way in contrast with “deep neural networks”.

We will use the neuralnet R package, which is not intended to work with deep neural networks, to build a simple neural network to predict if a type of stock pays dividends or not.

Code
if (!require(neuralnet)) 
  install.packages("neuralnet", dep=TRUE)
if (!require(caret)) 
  install.packages("caret", dep=TRUE)

The data for the example are the dividendinfo.csv dataset, available from: https://github.com/MGCodesandStats/datasets

Code
mydata <- read.csv("https://raw.githubusercontent.com/MGCodesandStats/datasets/master/dividendinfo.csv")
str(mydata)
'data.frame':   200 obs. of  6 variables:
 $ dividend       : int  0 1 1 0 1 1 1 0 1 1 ...
 $ fcfps          : num  2.75 4.96 2.78 0.43 2.94 3.9 1.09 2.32 2.5 4.46 ...
 $ earnings_growth: num  -19.25 0.83 1.09 12.97 2.44 ...
 $ de             : num  1.11 1.09 0.19 1.7 1.83 0.46 2.32 3.34 3.15 3.33 ...
 $ mcap           : int  545 630 562 388 684 621 656 351 658 330 ...
 $ current_ratio  : num  0.924 1.469 1.976 1.942 2.487 ...

Data pre-processing

One of the most important procedures when forming a neural network is data normalization. This involves adjusting the data to a common scale so as to accurately compare predicted and actual values. Failure to normalize the data will typically result in the prediction value remaining the same across all observations, regardless of the input values.

We can do this in two ways in R:

  • Scale the data frame automatically using the scale function in R
  • Transform the data using a max-min normalization technique

In this example We implement the max-min normalization technique.

See this link for further details on how to use the normalization function.

Code
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
normData <- as.data.frame(lapply(mydata, normalize))

As usually, the dataset is separated in a training and a test set. The training set contains a random selection with and (arbitrary) 66% of the observations.

Code
perc2Train <- 2/3
ssize <- nrow(normData)
set.seed(12345)
data_rows <- floor(perc2Train *ssize)
train_indices <- sample(c(1:ssize), data_rows)
trainset <- normData[train_indices,]
testset <- normData[-train_indices,]

The trainset set will be used to train the network and the testset set one will be used to evaluate it.

Training a neural network

Setting the parameters of a neural network requires experience and understanding of their meaning, and even so, changes in the parameters can lead to similar results.

We create a simple NN with two hidden layers, with 3 and 2 neurons respectively. This is specified in the hidden parameter. For other parameters see the package help.

Code
# Neural Network
library(neuralnet)
nn <- neuralnet(dividend ~ fcfps + earnings_growth + de + mcap + current_ratio, 
                data=trainset, 
                hidden=c(3,2), 
                linear.output=FALSE, 
                threshold=0.01)

The output of the procedure is a neural network with estimated weights.

This can be seen with a plot function (including the rep argument).

Code
plot(nn, rep = "best")

The object nncontains information the weights and the results although it is not particularly clear or useful.

Code
summary(nn)
                    Length Class      Mode    
call                  6    -none-     call    
response            133    -none-     numeric 
covariate           665    -none-     numeric 
model.list            2    -none-     list    
err.fct               1    -none-     function
act.fct               1    -none-     function
linear.output         1    -none-     logical 
data                  6    data.frame list    
exclude               0    -none-     NULL    
net.result            1    -none-     list    
weights               1    -none-     list    
generalized.weights   1    -none-     list    
startweights          1    -none-     list    
result.matrix        32    -none-     numeric 
Code
nn$result.matrix
                                     [,1]
error                        5.096531e-01
reached.threshold            9.874263e-03
steps                        1.798000e+04
Intercept.to.1layhid1       -1.243872e+00
fcfps.to.1layhid1           -1.349137e-01
earnings_growth.to.1layhid1  3.151554e+00
de.to.1layhid1              -5.249806e+00
mcap.to.1layhid1             9.908495e-01
current_ratio.to.1layhid1    6.527535e+00
Intercept.to.1layhid2        1.660208e+00
fcfps.to.1layhid2           -2.401517e-01
earnings_growth.to.1layhid2 -1.385771e+00
de.to.1layhid2               7.682849e-01
mcap.to.1layhid2            -4.058053e+00
current_ratio.to.1layhid2   -2.855816e+00
Intercept.to.1layhid3        2.982002e+00
fcfps.to.1layhid3           -2.877651e+00
earnings_growth.to.1layhid3 -6.957763e-02
de.to.1layhid3              -2.965334e+00
mcap.to.1layhid3            -5.034300e+00
current_ratio.to.1layhid3   -1.086037e+00
Intercept.to.2layhid1        9.282087e-02
1layhid1.to.2layhid1        -2.341614e+00
1layhid2.to.2layhid1         3.001315e+00
1layhid3.to.2layhid1         5.107051e+00
Intercept.to.2layhid2       -4.188729e-02
1layhid1.to.2layhid2         3.029232e+00
1layhid2.to.2layhid2        -4.732821e+00
1layhid3.to.2layhid2        -9.017001e+00
Intercept.to.dividend       -3.761263e-01
2layhid1.to.dividend        -3.054146e+02
2layhid2.to.dividend         1.494655e+02

Model evaluation

A prediction for each value in the testset dataset can be built with the compute function.

Code
#Test the resulting output
temp_test <- subset(testset, select =
                      c("fcfps","earnings_growth", 
                        "de", "mcap", "current_ratio"))
head(temp_test)
       fcfps earnings_growth        de       mcap current_ratio
9  0.4929006      0.52417860 0.7862595 0.79741379   0.662994637
19 0.8722110      0.89705139 0.5190840 0.31465517   0.631284474
22 0.0811359      0.68272957 0.4554707 0.05747126   0.000785556
26 0.4077079      0.07649537 0.6310433 0.70977011   0.379642293
27 0.4279919      0.70362258 0.1882952 0.30603448   0.628283435
29 0.3509128      0.74203875 0.6030534 0.53017241   0.543404499
Code
nn.results <- compute(nn, temp_test)
results <- data.frame(actual = 
                  testset$dividend, 
                  prediction = nn.results$net.result)
head(results)
   actual    prediction
9       1  1.000000e+00
19      1  1.000000e+00
22      0 5.442517e-133
26      0  6.801894e-35
27      1  4.548179e-10
29      1  1.000000e+00

A confusion matrix can be built to evaluate the predictive ability of the network:

Code
roundedresults<-sapply(results,round,digits=0)
roundedresultsdf=data.frame(roundedresults)
attach(roundedresultsdf)
confMat<- caret::confusionMatrix(table(actual, prediction))
confMat
Confusion Matrix and Statistics

      prediction
actual  0  1
     0 34  2
     1  6 25
                                         
               Accuracy : 0.8806         
                 95% CI : (0.7782, 0.947)
    No Information Rate : 0.597          
    P-Value [Acc > NIR] : 3.405e-07      
                                         
                  Kappa : 0.7577         
                                         
 Mcnemar's Test P-Value : 0.2888         
                                         
            Sensitivity : 0.8500         
            Specificity : 0.9259         
         Pos Pred Value : 0.9444         
         Neg Pred Value : 0.8065         
             Prevalence : 0.5970         
         Detection Rate : 0.5075         
   Detection Prevalence : 0.5373         
      Balanced Accuracy : 0.8880         
                                         
       'Positive' Class : 0              
                                         

References and resources

Allaire, JJ, Tomasz Kalinowski, and Francois Chollet. 2022. “Deep Learning with r Second Edition.” https://www.manning.com/books/deep-learning-with-r-second-edition.